6 research outputs found
Assorted, Archetypal and Annotated Two Million (3A2M) Cooking Recipes Dataset based on Active Learning
Cooking recipes allow individuals to exchange culinary ideas and provide food
preparation instructions. Due to a lack of adequate labeled data, categorizing
raw recipes found online to the appropriate food genres is a challenging task
in this domain. Utilizing the knowledge of domain experts to categorize recipes
could be a solution. In this study, we present a novel dataset of two million
culinary recipes labeled in respective categories leveraging the knowledge of
food experts and an active learning technique. To construct the dataset, we
collect the recipes from the RecipeNLG dataset. Then, we employ three human
experts whose trustworthiness score is higher than 86.667% to categorize 300K
recipe by their Named Entity Recognition (NER) and assign it to one of the nine
categories: bakery, drinks, non-veg, vegetables, fast food, cereals, meals,
sides and fusion. Finally, we categorize the remaining 1900K recipes using
Active Learning method with a blend of Query-by-Committee and Human In The Loop
(HITL) approaches. There are more than two million recipes in our dataset, each
of which is categorized and has a confidence score linked with it. For the 9
genres, the Fleiss Kappa score of this massive dataset is roughly 0.56026. We
believe that the research community can use this dataset to perform various
machine learning tasks such as recipe genre classification, recipe generation
of a specific genre, new recipe creation, etc. The dataset can also be used to
train and evaluate the performance of various NLP tasks such as named entity
recognition, part-of-speech tagging, semantic role labeling, and so on. The
dataset will be available upon publication: https://tinyurl.com/3zu4778y
Towards Automated Recipe Genre Classification using Semi-Supervised Learning
Sharing cooking recipes is a great way to exchange culinary ideas and provide
instructions for food preparation. However, categorizing raw recipes found
online into appropriate food genres can be challenging due to a lack of
adequate labeled data. In this study, we present a dataset named the
``Assorted, Archetypal, and Annotated Two Million Extended (3A2M+) Cooking
Recipe Dataset" that contains two million culinary recipes labeled in
respective categories with extended named entities extracted from recipe
descriptions. This collection of data includes various features such as title,
NER, directions, and extended NER, as well as nine different labels
representing genres including bakery, drinks, non-veg, vegetables, fast food,
cereals, meals, sides, and fusions. The proposed pipeline named 3A2M+ extends
the size of the Named Entity Recognition (NER) list to address missing named
entities like heat, time or process from the recipe directions using two NER
extraction tools. 3A2M+ dataset provides a comprehensive solution to the
various challenging recipe-related tasks, including classification, named
entity recognition, and recipe generation. Furthermore, we have demonstrated
traditional machine learning, deep learning and pre-trained language models to
classify the recipes into their corresponding genre and achieved an overall
accuracy of 98.6\%. Our investigation indicates that the title feature played a
more significant role in classifying the genre
Computational Sarcasm Analysis on Social Media: A Systematic Review
Sarcasm can be defined as saying or writing the opposite of what one truly
wants to express, usually to insult, irritate, or amuse someone. Because of the
obscure nature of sarcasm in textual data, detecting it is difficult and of
great interest to the sentiment analysis research community. Though the
research in sarcasm detection spans more than a decade, some significant
advancements have been made recently, including employing unsupervised
pre-trained transformers in multimodal environments and integrating context to
identify sarcasm. In this study, we aim to provide a brief overview of recent
advancements and trends in computational sarcasm research for the English
language. We describe relevant datasets, methodologies, trends, issues,
challenges, and tasks relating to sarcasm that are beyond detection. Our study
provides well-summarized tables of sarcasm datasets, sarcastic features and
their extraction methods, and performance analysis of various approaches which
can help researchers in related domains understand current state-of-the-art
practices in sarcasm detection.Comment: 50 pages, 3 tables, Submitted to 'Data Mining and Knowledge
Discovery' for possible publicatio
BenLLMEval: A Comprehensive Evaluation into the Potentials and Pitfalls of Large Language Models on Bengali NLP
Large Language Models (LLMs) have emerged as one of the most important
breakthroughs in natural language processing (NLP) for their impressive skills
in language generation and other language-specific tasks. Though LLMs have been
evaluated in various tasks, mostly in English, they have not yet undergone
thorough evaluation in under-resourced languages such as Bengali (Bangla). In
this paper, we evaluate the performance of LLMs for the low-resourced Bangla
language. We select various important and diverse Bangla NLP tasks, such as
abstractive summarization, question answering, paraphrasing, natural language
inference, text classification, and sentiment analysis for zero-shot evaluation
with ChatGPT, LLaMA-2, and Claude-2 and compare the performance with
state-of-the-art fine-tuned models. Our experimental results demonstrate an
inferior performance of LLMs for different Bangla NLP tasks, calling for
further effort to develop better understanding of LLMs in low-resource
languages like Bangla.Comment: First two authors contributed equall
BanglaBook: A Large-scale Bangla Dataset for Sentiment Analysis from Book Reviews
The analysis of consumer sentiment, as expressed through reviews, can provide
a wealth of insight regarding the quality of a product. While the study of
sentiment analysis has been widely explored in many popular languages,
relatively less attention has been given to the Bangla language, mostly due to
a lack of relevant data and cross-domain adaptability. To address this
limitation, we present BanglaBook, a large-scale dataset of Bangla book reviews
consisting of 158,065 samples classified into three broad categories: positive,
negative, and neutral. We provide a detailed statistical analysis of the
dataset and employ a range of machine learning models to establish baselines
including SVM, LSTM, and Bangla-BERT. Our findings demonstrate a substantial
performance advantage of pre-trained models over models that rely on manually
crafted features, emphasizing the necessity for additional training resources
in this domain. Additionally, we conduct an in-depth error analysis by
examining sentiment unigrams, which may provide insight into common
classification errors in under-resourced languages like Bangla. Our codes and
data are publicly available at https://github.com/mohsinulkabir14/BanglaBook.Comment: Accepted in ACL Findings 202